Setup and Import Data

import pandas as pd
import numpy as np
import seaborn as sns
np.random.seed(44)

from google.colab import drive
drive.mount("/content/gdrive")

train = pd.read_csv('/content/gdrive/My Drive/3253 Machine Learning Term Project/train.csv')
test = pd.read_csv('/content/gdrive/My Drive/3253 Machine Learning Term Project/test.csv')
print(train.shape)
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
(63999, 37)

Data Exploration

train.shape
(63999, 37)
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63999 entries, 0 to 63998
Data columns (total 37 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Customer_id                  63999 non-null  object 
 1   Target                       63999 non-null  int64  
 2   Balance                      63999 non-null  float64
 3   PreviousCampaignResult       63999 non-null  int64  
 4   Product1                     63999 non-null  int64  
 5   Product2                     63999 non-null  int64  
 6   Product3                     63999 non-null  int64  
 7   Product4                     63999 non-null  int64  
 8   Product5                     63999 non-null  int64  
 9   Product6                     63999 non-null  int64  
 10  Transaction1                 63999 non-null  float64
 11  Transaction2                 63999 non-null  float64
 12  Transaction3                 63999 non-null  float64
 13  Transaction4                 63999 non-null  float64
 14  Transaction5                 63999 non-null  float64
 15  Transaction6                 63999 non-null  float64
 16  Transaction7                 63999 non-null  float64
 17  Transaction8                 63999 non-null  float64
 18  Transaction9                 63999 non-null  float64
 19  ExternalAccount1             63999 non-null  int64  
 20  ExternalAccount2             63999 non-null  int64  
 21  ExternalAccount3             63999 non-null  int64  
 22  ExternalAccount4             63999 non-null  int64  
 23  ExternalAccount5             63999 non-null  int64  
 24  ExternalAccount6             63999 non-null  int64  
 25  ExternalAccount7             63999 non-null  int64  
 26  ActivityIndicator            63999 non-null  int64  
 27  RegularInteractionIndicator  63999 non-null  int64  
 28  CompetitiveRate1             63999 non-null  float64
 29  CompetitiveRate2             63999 non-null  float64
 30  CompetitiveRate3             63999 non-null  float64
 31  CompetitiveRate4             63999 non-null  float64
 32  CompetitiveRate5             63999 non-null  float64
 33  CompetitiveRate6             63999 non-null  float64
 34  CompetitiveRate7             63999 non-null  float64
 35  RateBefore                   63999 non-null  float64
 36  ReferenceRate                63999 non-null  float64
dtypes: float64(19), int64(17), object(1)
memory usage: 18.1+ MB
#count the unique values in "Target", which is the prediction we are going to make
train["Target"].value_counts()
0    32014
1    31985
Name: Target, dtype: int64
#Check if the data is balanced
%matplotlib inline
import matplotlib.pyplot as plt

def check_data_balance(series, style="seaborn-pastel"):
  with plt.style.context(style):
    unique = series.value_counts()
    display(unique) #show unique value counts of the target
    plt.pie(unique, explode=[0.05]*len(unique), labels=unique.index, autopct='%1.1f%%'); #plot a pie chart for the target to see if data are balanced
check_data_balance(train["Target"])
0    32014
1    31985
Name: Target, dtype: int64
#inspect the data using the dataframe's describe() function
train.describe()
Target Balance PreviousCampaignResult Product1 Product2 Product3 Product4 Product5 Product6 Transaction1 Transaction2 Transaction3 Transaction4 Transaction5 Transaction6 Transaction7 Transaction8 Transaction9 ExternalAccount1 ExternalAccount2 ExternalAccount3 ExternalAccount4 ExternalAccount5 ExternalAccount6 ExternalAccount7 ActivityIndicator RegularInteractionIndicator CompetitiveRate1 CompetitiveRate2 CompetitiveRate3 CompetitiveRate4 CompetitiveRate5 CompetitiveRate6 CompetitiveRate7 RateBefore ReferenceRate
count 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 6.399900e+04 6.399900e+04 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000 63999.000000
mean 0.499773 22697.014218 0.001484 0.200441 0.160456 0.001766 0.092158 0.022125 0.041876 312.852441 98.306560 972.439108 13346.706243 5.872490e+03 8.831981e+03 101.191833 1522.355254 1123.682674 0.197034 0.301911 0.082533 0.123783 0.191862 0.100423 0.019313 19.776778 0.595291 1.379736 1.836435 1.391770 1.757322 1.882755 1.678434 1.412233 1.678434 1.315226
std 0.500004 10819.489043 0.039303 0.400333 0.367031 0.041983 0.289251 0.147092 0.200307 4932.061568 897.296635 5903.126316 25046.228309 2.226038e+04 3.051680e+04 894.873343 7410.544303 9570.708952 0.397762 0.459090 0.275176 0.329337 0.393768 0.300566 0.137623 104.827436 1.162643 0.206340 0.370122 0.161833 0.357483 0.516006 0.160600 0.575083 0.160600 0.106634
min 0.000000 10000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -34.340000 -33813.000000 0.000000e+00 -2.092436e+04 -957.010000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.025000 1.425000 1.125000 1.425000 1.175000 1.425000 0.525000 1.425000 1.125000
25% 0.000000 13499.005000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.425000 1.575000 1.425000 1.575000 1.575000 1.675000 0.525000 1.675000 1.375000
50% 0.000000 19982.570000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5670.000000 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 0.000000 1.425000 1.925000 1.425000 1.675000 2.375000 1.675000 1.575000 1.675000 1.375000
75% 1.000000 30104.035000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 17500.000000 2.000000e+03 5.500000e+03 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 12.000000 1.000000 1.575000 2.375000 1.575000 1.675000 2.375000 1.875000 1.575000 1.875000 1.375000
max 1.000000 50000.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 505000.000000 85500.000000 633506.040000 917005.000000 1.330000e+06 2.029492e+06 36186.980000 297055.190000 513627.920000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 6762.000000 56.000000 1.575000 2.375000 1.575000 2.375000 2.375000 1.875000 2.125000 1.875000 1.375000
%matplotlib inline
import matplotlib.pyplot as plt

#draw histogram of each feature
train.hist(bins=50, figsize=(20,15))

#save_fig("attribute_histogram_plots")
plt.show()
import plotly.express as px


fig = px.scatter_matrix(train.iloc[:,1:8], height=1000)
fig.show()

fig = px.scatter_matrix(train.iloc[:,8:16], height=1000)
fig.show()

fig = px.scatter_matrix(train.iloc[:,16:24], height=1000)
fig.show()

fig = px.scatter_matrix(train.iloc[:,24:32], height=1000)
fig.show()

fig = px.scatter_matrix(train.iloc[:,32:37], height=1000)
fig.show()
len(train.columns)
37
# Check PreviousCampaignResult feature values - almost all of them are zero

train['PreviousCampaignResult'].value_counts()
0    63906
1       91
2        2
Name: PreviousCampaignResult, dtype: int64
# Check values for Product features

products = train[['Product1', 'Product2', 'Product3', 'Product4', 'Product5', 'Product6']]
#products

products.apply(pd.Series.value_counts)
Product1 Product2 Product3 Product4 Product5 Product6
0 51171 53730 63886 58101 62583 61319
1 12828 10269 113 5898 1416 2680
# Products value counts in % contribution

products.apply(lambda x: pd.value_counts(x, normalize=True).mul(100).round(1).astype(str) + '%')
Product1 Product2 Product3 Product4 Product5 Product6
0 80.0% 84.0% 99.8% 90.8% 97.8% 95.8%
1 20.0% 16.0% 0.2% 9.2% 2.2% 4.2%
# Box plots for Transaction features

plt.subplot(331)
sns.boxplot(train["Transaction1"])

plt.subplot(332)
sns.boxplot(train["Transaction2"])

plt.subplot(333)
sns.boxplot(train["Transaction3"])

plt.subplot(334)
sns.boxplot(train["Transaction4"])

plt.subplot(335)
sns.boxplot(train["Transaction5"])

plt.subplot(336)
sns.boxplot(train["Transaction6"])

plt.subplot(337)
sns.boxplot(train["Transaction7"])

plt.subplot(338)
sns.boxplot(train["Transaction8"])

plt.subplot(339)
sns.boxplot(train["Transaction9"])

fig = plt.gcf()
fig.set_size_inches(10,10)
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

# Count non-zero values in each Transaction column

transactions = train[['Transaction1', 'Transaction2', 'Transaction3', 'Transaction4', 'Transaction5', 'Transaction6', 'Transaction7', 
                      'Transaction8', 'Transaction9']]
# transactions

transactions.apply(lambda x: np.count_nonzero(x))
Transaction1     1856
Transaction2     3651
Transaction3     4895
Transaction4    46538
Transaction5    19515
Transaction6    25255
Transaction7     3033
Transaction8     4672
Transaction9     6221
dtype: int64
# Check values for External Account features

external_accounts = train[['ExternalAccount1', 'ExternalAccount2', 'ExternalAccount3', 'ExternalAccount4', 'ExternalAccount5', 'ExternalAccount6', 'ExternalAccount7']]
# external_accounts

external_accounts.apply(lambda x: pd.value_counts(x, normalize=True).mul(100).round(1).astype(str) + '%')
ExternalAccount1 ExternalAccount2 ExternalAccount3 ExternalAccount4 ExternalAccount5 ExternalAccount6 ExternalAccount7
0 80.3% 69.8% 91.7% 87.6% 80.8% 90.0% 98.1%
1 19.7% 30.2% 8.3% 12.4% 19.2% 10.0% 1.9%
# Check values for Activity Indicator feature

sns.boxplot(train['ActivityIndicator'])
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

<matplotlib.axes._subplots.AxesSubplot at 0x7eff8660bfd0>
# Group values into buckets by percentile

p50 = np.percentile(train["ActivityIndicator"],50)
p75 = np.percentile(train["ActivityIndicator"],75)
p99 = np.percentile(train["ActivityIndicator"],99)
bins = [-1, 0, p50, p75, p99, np.inf]
train["ActivityIndicator"].value_counts(bins=bins, sort=False, normalize=True).mul(100).round(1).astype(str) + '%'
(-1.001, 0.0]    31.5%
(0.0, 4.0]       24.5%
(4.0, 12.0]      21.2%
(12.0, 316.0]    21.7%
(316.0, inf]      1.0%
Name: ActivityIndicator, dtype: object
# Check target variable for customers with no activity

inactive = train[train['ActivityIndicator'] == 0]
inactive["Target"].value_counts()
0    12636
1     7547
Name: Target, dtype: int64
inactive["Target"].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
0    62.6%
1    37.4%
Name: Target, dtype: object
# Check target variable for customers with some activity

active = train[train['ActivityIndicator']  != 0]
active["Target"].value_counts()
1    24438
0    19378
Name: Target, dtype: int64
active["Target"].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
1    55.8%
0    44.2%
Name: Target, dtype: object
# Check values for Regular Interaction Indicator feature

sns.countplot(train["RegularInteractionIndicator"])
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning:

Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

<matplotlib.axes._subplots.AxesSubplot at 0x7eff84c4fd10>
# Group values into buckets by percentile

p50 = np.percentile(train["RegularInteractionIndicator"],50)
p75 = np.percentile(train["RegularInteractionIndicator"],75)
p99 = np.percentile(train["RegularInteractionIndicator"],99)
bins = [-1, p50, p75, p99, np.inf]
train["RegularInteractionIndicator"].value_counts(bins=bins, sort=False, normalize=True).mul(100).round(1).astype(str) + '%'
(-1.001, 0.0]    63.1%
(0.0, 1.0]       25.3%
(1.0, 5.0]       10.8%
(5.0, inf]        0.8%
Name: RegularInteractionIndicator, dtype: object
# Check target variable for customers with zero interaction frequency score

infrequent = train[train['RegularInteractionIndicator'] == 0]
infrequent["Target"].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
0    50.7%
1    49.3%
Name: Target, dtype: object
# Summarize rate offers

rates = train[['CompetitiveRate1', 'CompetitiveRate2', 'CompetitiveRate4', 'CompetitiveRate4', 'CompetitiveRate5', 'CompetitiveRate6', 'CompetitiveRate7', 
               'RateBefore', 'ReferenceRate']]
rates.apply(lambda x: pd.value_counts(x, normalize=True).mul(100).round(1).astype(str) + '%')
CompetitiveRate1 CompetitiveRate2 CompetitiveRate4 CompetitiveRate4 CompetitiveRate5 CompetitiveRate6 CompetitiveRate7 RateBefore ReferenceRate
0.525 NaN NaN NaN NaN NaN NaN 25.7% NaN NaN
1.025 23.9% NaN NaN NaN NaN NaN NaN NaN NaN
1.125 NaN NaN NaN NaN NaN NaN NaN NaN 23.9%
1.175 NaN NaN NaN NaN 23.9% NaN NaN NaN NaN
1.375 NaN NaN NaN NaN NaN NaN NaN NaN 76.1%
1.425 26.7% 23.8% 23.8% 23.8% NaN 23.9% NaN 23.9% NaN
1.475 23.8% NaN NaN NaN NaN NaN 23.9% NaN NaN
1.575 25.7% 25.7% 25.7% 25.7% 25.7% NaN 26.7% NaN NaN
1.675 NaN NaN 26.7% 26.7% NaN 26.7% NaN 26.7% NaN
1.725 NaN NaN NaN NaN NaN 23.8% NaN 23.8% NaN
1.875 NaN NaN NaN NaN NaN 25.7% NaN 25.7% NaN
1.925 NaN 23.9% NaN NaN NaN NaN NaN NaN NaN
2.125 NaN NaN NaN NaN NaN NaN 23.8% NaN NaN
2.375 NaN 26.7% 23.9% 23.9% 50.4% NaN NaN NaN NaN
# Check if there's correlation between balance and activity/interaction indicators

train_corr = train[['Balance', 'ActivityIndicator',	'RegularInteractionIndicator']].copy()
# sns.pairplot(train_corr)
sns.heatmap(train_corr.corr(),
            annot=True,
            linewidth=.5,
            center = 0,
            fmt='.1g',
            cbar=False,
            cmap='GnBu')
<matplotlib.axes._subplots.AxesSubplot at 0x7eff9a05f950>
 

Observations:

  • Training dataset contains 63,999 instances and 37 features
  • Dataset doesn't have null values
  • All features are numerical (except for Customer ID)
  • Target variable is balanced, i.e. equally divided between positive and negative label
  • Balance variable is not normally distributed, it takes values between 10,000 and 50,000 with median of around 20,000
  • Almost all values of Previous Campaign Result variable are zero. Since most values are the same, it could potentially be excluded from the model
  • Product 3 is only used by 0.2% of customers
  • 31.5% of customers did not perform any activity with the bank, and 62.6% of them did not respond to previous campaigns (i.e. target = 0)
  • 63.1% of customers had zero interaction frequency score, however target variable for these customers was distributed evenly
  • Majority of customers had the rate of 1.375% after negotiations
  • There seems to be no correlation between account balance and number/frequency of customer activity